In this notebook, we observe the annotation results obtained from cellassign https://github.com/Irrationone/cellassign Those results were obtained in R in the R markdown named: cellassign.Rmd/cellassign.html
We predict labels for the PBMC3k datasets using:
PMBC3k predition by cellassign is compared with the sig-annot and auto-annot prediction.
import besca as bc
import scanpy as sc
import pandas as pd
import logging
sc.logging.print_versions()
WARNING: If you miss a compact list, please try `print_header`! ----- anndata 0.7.5 scanpy 1.6.0 sinfo 0.3.1 ----- PIL 8.1.0 anndata 0.7.5 annoy NA attr 20.3.0 backcall 0.2.0 bbknn NA besca 2.3+36.g8e736be biothings_client 0.2.1 cached_property 1.5.2 certifi 2020.12.05 cffi 1.14.4 chardet 3.0.4 colorama 0.4.4 cycler 0.10.0 cython_runtime NA dateutil 2.8.1 decorator 4.4.2 dominate 2.6.0 fbpca NA get_version 2.1 greenlet 0.4.17 h5py 3.1.0 idna 2.7 igraph 0.8.3 importlib_metadata 1.1.0 iniconfig NA intervaltree NA ipykernel 5.4.3 ipython_genutils 0.2.0 ipywidgets 7.6.3 jedi 0.18.0 joblib 0.14.0 kiwisolver 1.3.1 legacy_api_wrap 1.2 leidenalg 0.8.3 llvmlite 0.35.0 lxml 4.6.2 matplotlib 3.3.3 more_itertools 8.6.0 mpl_toolkits NA mygene 3.1.0 natsort 6.2.0 networkx 2.5 nose 1.3.7 numba 0.52.0 numexpr 2.7.2 numpy 1.19.5 packaging 20.8 pandas 1.2.0 parso 0.8.1 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA plotly 4.14.3 pluggy 0.13.1 prompt_toolkit 3.0.10 ptyprocess 0.7.0 py 1.10.0 pycparser 2.20 pygments 2.7.4 pyparsing 2.4.7 pytest 6.2.1 pytz 2020.5 requests 2.19.1 requests_cache 0.5.2 scanorama 1.7 scanpy 1.6.0 scipy 1.5.1 scvelo 0.2.2 seaborn 0.11.1 setuptools_scm NA sinfo 0.3.1 six 1.15.0 sklearn 0.21.3 sortedcontainers 2.3.0 statsmodels 0.10.2 storemagic NA tables 3.6.1 texttable 1.6.3 tornado 6.1 traitlets 5.0.5 umap 0.3.10 urllib3 1.23 wcwidth 0.2.5 yaml 5.1.2 zipp NA zmq 20.0.0 zope NA ----- IPython 7.19.0 jupyter_client 6.1.11 jupyter_core 4.7.0 notebook 6.1.6 ----- Python 3.7.1 | packaged by conda-forge | (default, Mar 13 2019, 12:57:14) [GCC 7.3.0] Linux-3.10.0-693.11.6.el7.x86_64-x86_64-with-centos-7.4.1708-Core 24 logical CPU cores, x86_64 ----- Session information updated at 2021-01-14 11:14
We load the dataset predicted with Auto-Annot
pbmc3K_prediction = sc.read_h5ad("adata_predicted_16072020.h5ad")
In this pbm3k data are stored multiples scores and information. Leiden is the leiden clustering on the whole datasets post classical filtering. dblabel is the results of the sig-annot procedure on the whole dataset also after reclustering around T-cells/NK-cells for a finer grain annotation. Finally auto_annot is the results of the auto-annot procedure using Granja and Kotliarov datasets as training sets.
sc.pl.umap(pbmc3K_prediction, color = ["leiden", "dblabel", "auto_annot"], ncols = 1)
#bc.tl.auto_annot.report(pbmc3K_prediction, celltype = 'dblabel', method = 'autoannot', analysis_name= 'autoannot',
# train_datasets='pmbc3k', test_dataset='GranjaAndKV', merge = True, use_raw=False)
We load the prediction obtained with cellassign and upload the said results in our h5ad object.
cellassign predictions were obtained in R, please see cellassign.html / cellassign.Rmd
read_bescaF_prediction= pd.read_csv("pbmc3k_filtered_cellassign_bescaMarkersF.csv", header= None)
read_bescaF_prediction.index=list(read_bescaF_prediction[0])
read_bescaF_prediction=read_bescaF_prediction.drop(columns=0)
read_bescaF_prediction.rename(columns={1: 'bescaF_label'},
inplace=True)
read_bescaFv2_prediction= pd.read_csv("pbmc3k_filtered_cellassign_bescaFv2.csv", header= None)
read_bescaFv2_prediction.index=list(read_bescaFv2_prediction[0])
read_bescaFv2_prediction=read_bescaFv2_prediction.drop(columns=0)
read_bescaFv2_prediction.rename(columns={1: 'bescaF_label'},
inplace=True)
read_bescaVvs_prediction= pd.read_csv("pbmc3k_filtered_cellassign_bescaVvshort.csv", header= None)
read_bescaVvs_prediction.index=list(read_bescaVvs_prediction[0])
read_bescaVvs_prediction=read_bescaVvs_prediction.drop(columns=0)
read_bescaVvs_prediction.rename(columns={1: 'bescaF_label'},
inplace=True)
read_cellassign_prediction= pd.read_csv("pbmc3k_filtered_cellassign.csv", header= None)
read_cellassign_prediction.index=list(read_cellassign_prediction[0])
read_cellassign_prediction=read_cellassign_prediction.drop(columns=0)
read_cellassign_prediction.rename(columns={1: 'cellassign_label'},
inplace=True)
bescaF_probabilities=pd.read_csv("pbmc3k_filtered_cellassign_bescaMarkersF_probabilities.csv")
bescaFv2_probabilities=pd.read_csv("pbmc3k_filtered_cellassign_bescaMarkersF_probabilities.csv")
bescaVvs_probabilities=pd.read_csv("pbmc3k_filtered_cellassign_bescaVvshort_probabilities.csv")
We need to map back the labels to dblabels using the dblabel nomenclature (described in besca in CellTypes_v1.tsv) when using the singlecell datasets provided by cellassign (in the celldex package)
set(read_cellassign_prediction['cellassign_label'])
{'B cells',
'Cytotoxic T cells',
'Epithelial cells',
'Monocyte/Macrophage',
'T cells',
'Vascular smooth muscle cells',
'unassigned'}
dblabel_mapping = {'B cells': 'B cell',
'Cytotoxic T cells': 'CD8-positive, alpha-beta cytotoxic T cell',
'Epithelial cells': 'epithelial cell',
'Monocyte/Macrophage': 'myeloid cell',
'T cells': 'T cell',
'Vascular smooth muscle cells': 'smooth muscle cell',
'unassigned': 'unassigned'}
read_cellassign_prediction['dblabel_cellassign'] = read_cellassign_prediction.cellassign_label.copy().map(dblabel_mapping)
bescapath='/pstore/home/schwalip/src/besca_new/besca/'
### transform these short forms to dblabel - EFO standard nomenclature
nomenclature=pd.read_csv(bescapath+'/besca/datasets/nomenclature/CellTypes_v1.tsv',sep='\t',header=0,skiprows=range(1, 2))
read_bescaF_prediction["bescaF_label"].replace(to_replace='unassigned', value="Cell", inplace=True)
read_bescaF_prediction["bescaF_label"].replace(to_replace='Prolif', value="ProlifMyeloid", inplace=True)
read_bescaVvs_prediction["bescaF_label"].replace(to_replace='unassigned', value="Cell", inplace=True)
read_bescaVvs_prediction["bescaF_label"].replace(to_replace='Prolif', value="ProlifMyeloid", inplace=True)
read_bescaFv2_prediction["bescaF_label"].replace(to_replace='unassigned', value="Cell", inplace=True)
read_bescaFv2_prediction["bescaF_label"].replace(to_replace='Prolif', value="ProlifMyeloid", inplace=True)
read_bescaF_prediction["bescaF_label"]=[list(nomenclature.loc[nomenclature['short_dblabel']==x,'dblabel'])[0] for x in list(read_bescaF_prediction["bescaF_label"])]
read_bescaFv2_prediction["bescaF_label"]=[list(nomenclature.loc[nomenclature['short_dblabel']==x,'dblabel'])[0] for x in list(read_bescaFv2_prediction["bescaF_label"])]
read_bescaVvs_prediction["bescaF_label"]=[list(nomenclature.loc[nomenclature['short_dblabel']==x,'dblabel'])[0] for x in list(read_bescaVvs_prediction["bescaF_label"])]
read_bescaF_prediction["bescaF_label"].replace(to_replace='animal cell', value="unassigned", inplace=True)
read_bescaF_prediction["bescaF_label"].replace(to_replace='proliferating myeloid leukocyte', value="proliferating hematopoietic cell", inplace=True)
read_bescaFv2_prediction["bescaF_label"].replace(to_replace='animal cell', value="unassigned", inplace=True)
read_bescaFv2_prediction["bescaF_label"].replace(to_replace='proliferating myeloid leukocyte', value="proliferating hematopoietic cell", inplace=True)
read_bescaVvs_prediction["bescaF_label"].replace(to_replace='animal cell', value="unassigned", inplace=True)
read_bescaVvs_prediction["bescaF_label"].replace(to_replace='proliferating myeloid leukocyte', value="proliferating hematopoietic cell", inplace=True)
read_bescaVvs_prediction["bescaVvs_label"]=read_bescaVvs_prediction["bescaF_label"].copy()
read_bescaFv2_prediction["bescaFv2_label"]=read_bescaFv2_prediction["bescaF_label"].copy()
obs_bis = pbmc3K_prediction.obs.join(read_bescaF_prediction["bescaF_label"], how = "left")
obs_bis = obs_bis.join(read_bescaVvs_prediction["bescaVvs_label"], how = "left")
obs_bis = obs_bis.join(read_bescaFv2_prediction["bescaFv2_label"], how = "left")
obs_bis = obs_bis.join(read_cellassign_prediction["cellassign_label"], how = "left")
obs_bis = obs_bis.join(read_cellassign_prediction["dblabel_cellassign"], how = "left")
pbmc3K_prediction.obs = obs_bis.copy()
for i in list(bescaF_probabilities.columns):
pbmc3K_prediction.obs['P_'+i]=bescaF_probabilities[i]
for i in list(bescaVvs_probabilities.columns):
pbmc3K_prediction.obs['PVvs_'+i]=bescaVvs_probabilities[i]
for i in list(bescaFv2_probabilities.columns):
pbmc3K_prediction.obs['PFv2_'+i]=bescaFv2_probabilities[i]
sc.pl.umap(pbmc3K_prediction,color=['P_ClassMonocyte','P_cDC','P_pDC','P_CytotoxCD8Tcell','P_NaiTcell',
'P_CD56dimNK','P_CD56brightNK','P_Mast'])
... storing 'bescaF_label' as categorical ... storing 'bescaVvs_label' as categorical ... storing 'bescaFv2_label' as categorical ... storing 'cellassign_label' as categorical ... storing 'dblabel_cellassign' as categorical
sc.pl.umap(pbmc3K_prediction,color=['PFv2_ClassMonocyte','PFv2_cDC','PFv2_pDC','PFv2_CytotoxCD8Tcell','PFv2_NaiTcell',
'PFv2_CD56dimNK','PFv2_CD56brightNK','PFv2_NClassMonocyte'])
sc.pl.umap(pbmc3K_prediction,color=['PVvs_ClassMonocyte','PVvs_cDC','PVvs_pDC','PVvs_Bcell','PVvs_NKcell',
'PVvs_Tcell','PVvs_NClassMonocyte'])
sc.pl.umap( pbmc3K_prediction, color = ["dblabel",'auto_annot' , 'bescaF_label','bescaFv2_label','bescaVvs_label',
'cellassign_label','dblabel_cellassign'], ncols= 1)
#sc.pl.umap( pbmc3K_prediction, color = [ 'dblabel', "auto_annot", "KV_dblabel","Granja_dblabel","KVandGranja_dblabel" ], ncols= 2)
We fix the palette to keep the same colors for the UMAPs celltypes
values_ = set( pd.Categorical(pbmc3K_prediction.obs.dblabel).categories.tolist()+\
pd.Categorical(pbmc3K_prediction.obs.bescaF_label).categories.tolist()+\
pd.Categorical(pbmc3K_prediction.obs.bescaFv2_label).categories.tolist()+\
pd.Categorical(pbmc3K_prediction.obs.bescaVvs_label).categories.tolist()+\
pd.Categorical(pbmc3K_prediction.obs.dblabel_cellassign).categories.tolist()+\
pd.Categorical(pbmc3K_prediction.obs.auto_annot).categories.tolist())
import seaborn as sns
#palette_ = sns.color_palette("tab10", len( values_) ) #
palette_ = sns.color_palette(n_colors=len( values_) ) #n_colors=30
palette_colors = dict(zip(values_, palette_))
#palette_colors
bc.pl.update_qualitative_palette( pbmc3K_prediction, palette_colors, group = 'dblabel', checkColors = True)
bc.pl.update_qualitative_palette( pbmc3K_prediction, palette_colors, group = 'auto_annot', checkColors = True)
bc.pl.update_qualitative_palette( pbmc3K_prediction, palette_colors, group = 'bescaF_label', checkColors = True)
bc.pl.update_qualitative_palette( pbmc3K_prediction, palette_colors, group = 'bescaFv2_label', checkColors = True)
bc.pl.update_qualitative_palette( pbmc3K_prediction, palette_colors, group = 'bescaVvs_label', checkColors = True)
bc.pl.update_qualitative_palette( pbmc3K_prediction, palette_colors, group = 'dblabel_cellassign', checkColors = True)
sc.pl.umap( pbmc3K_prediction, color = ["leiden", 'dblabel'], ncols= 2)
sc.pl.umap( pbmc3K_prediction, color = ["leiden", 'auto_annot'], ncols= 2)
sc.pl.umap( pbmc3K_prediction, color = ["leiden", 'bescaF_label'], ncols= 2)
sc.pl.umap( pbmc3K_prediction, color = ["leiden", 'bescaFv2_label'], ncols= 2)
sc.pl.umap( pbmc3K_prediction, color = ["leiden", 'bescaVvs_label'], ncols= 2)
sc.pl.umap( pbmc3K_prediction, color = ["leiden", 'dblabel_cellassign'], ncols= 2)
#annot_comparison in adata_pred.obs_keys()
adata_pred = pbmc3K_prediction
method = "cellassign"
merge = False
remove_nonshared = True
use_raw = True
genes_to_use = ""
clustering = "leiden"
asymmetric_matrix= True
fig = bc.tl.report(adata_pred,
celltype = "dblabel", method = "cellassign",
analysis_name = "bescaF",
train_datasets= "bescaF", test_dataset= "pbmc3k", name_report= "cellassign",
merge = merge, name_prediction = 'bescaF_label',
use_raw = use_raw, genes_to_use = genes_to_use,
remove_nonshared = False, clustering =clustering, asymmetric_matrix = True)
#fig.show()
WARNING: saving figure to file figures/umap.ondata_bescaF.png WARNING: saving figure to file figures/umap.bescaF.png Confusion matrix, without normalization Normalized confusion matrix
fig = bc.tl.report(adata_pred,
celltype = "dblabel", method = "cellassign",
analysis_name = "bescaFv2",
train_datasets= "bescaFv2", test_dataset= "pbmc3k", name_report= "cellassign_Fv2",
merge = merge, name_prediction = 'bescaFv2_label',
use_raw = use_raw, genes_to_use = genes_to_use,
remove_nonshared = False, clustering =clustering, asymmetric_matrix = True)
#fig.show()
WARNING: saving figure to file figures/umap.ondata_bescaFv2.png WARNING: saving figure to file figures/umap.bescaFv2.png Confusion matrix, without normalization Normalized confusion matrix
fig = bc.tl.report(adata_pred,
celltype = "dblabel", method = "cellassign",
analysis_name = "bescaVvs",
train_datasets= "bescaVvs", test_dataset= "pbmc3k", name_report= "cellassign_Vvs",
merge = merge, name_prediction = 'bescaVvs_label',
use_raw = use_raw, genes_to_use = genes_to_use,
remove_nonshared = False, clustering =clustering, asymmetric_matrix = True)
WARNING: saving figure to file figures/umap.ondata_bescaVvs.png WARNING: saving figure to file figures/umap.bescaVvs.png Confusion matrix, without normalization Normalized confusion matrix
fig = bc.tl.report(adata_pred,
celltype = "dblabel", method = "cellassign",
analysis_name = "cellassign",
train_datasets= "cellassign", test_dataset= "pbmc3k", name_report= "cellassign_example",
merge = merge, name_prediction = 'dblabel_cellassign',
use_raw = use_raw, genes_to_use = genes_to_use,
remove_nonshared = False, clustering =clustering, asymmetric_matrix = True)
WARNING: saving figure to file figures/umap.ondata_cellassign.png WARNING: saving figure to file figures/umap.cellassign.png Confusion matrix, without normalization Normalized confusion matrix
Besca report generate overall models. We retrieved the overall F1 and accuracy score for all reports generated in order to compare those values.
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(".") if isfile(join(".", f)) and '_report_' in f]
f1_values = pd.DataFrame(columns=['accuracy', 'f1' ] , index=onlyfiles)
onlyfiles
['cellassign_Fv2_report_bescaFv2.csv', 'cellassign_Vvs_report_bescaF.csv', 'cellassign_Vvs_report_bescaVvs.csv', 'cellassign_example_report_cellassign.csv', 'cellassign_report_bescaF.csv']
for report in onlyfiles :
get_info = open(report, "r").readlines()[3]
dd = get_info.split(',')
f1_values.loc[report] =[dd[1], dd[3].rstrip('\n')]
f1_values
| accuracy | f1 | |
|---|---|---|
| cellassign_Fv2_report_bescaFv2.csv | 0.21845047923322683 | 0.2619440008400595 |
| cellassign_Vvs_report_bescaF.csv | 0.10982428115015974 | 0.1471481025732684 |
| cellassign_Vvs_report_bescaVvs.csv | 0.10982428115015974 | 0.1471481025732684 |
| cellassign_example_report_cellassign.csv | 0.05990415335463259 | 0.02819018981394475 |
| cellassign_report_bescaF.csv | 0.20247603833865815 | 0.2480235059887537 |
A large number of cells remain unassigned. In particular T cell assignment seems problematic.
! jupyter nbconvert --to html Cellassign_comparison.ipynb
[NbConvertApp] Converting notebook Cellassign_comparison.ipynb to html [NbConvertApp] Writing 5622644 bytes to Cellassign_comparison.html
from sinfo import sinfo
sinfo()
----- anndata 0.7.5 besca 2.3+34.gf0f50cb pandas 1.2.0 plotly 4.14.1 scanpy 1.6.0 seaborn 0.11.1 sinfo 0.3.1 ----- IPython 7.19.0 jupyter_client 6.1.7 jupyter_core 4.7.0 notebook 6.1.6 ----- Python 3.7.1 | packaged by conda-forge | (default, Mar 13 2019, 12:57:14) [GCC 7.3.0] Linux-3.10.0-693.11.6.el7.x86_64-x86_64-with-centos-7.4.1708-Core 24 logical CPU cores, x86_64 ----- Session information updated at 2021-01-06 17:33